#Area Crosswalks
area_codes <- readRDS(here("State Data/area_crosswalk.RDS")) %>%
mutate(area_title = str_remove(area_title, " -- Statewide")) %>%
mutate(area_fips = case_when(
nchar(area_fips) == 4 ~ paste("0", area_fips, sep = ""),
TRUE ~ area_fips
))
states <- data.frame(state_abbr = state.abb, area_title = state.name)
area_codes <- left_join(area_codes, states) %>%
mutate(state_abbr = case_when(
area_title == "District of Columbia" ~ "DC",
TRUE ~ state_abbr))
## Joining with `by = join_by(area_title)`
#Map Data
county_sf <- counties(cb = TRUE) %>%
shift_geometry(position = "outside")
states_sf <- states(cb = TRUE, resolution = "20m") %>%
shift_geometry(position = "outside")
#Change projection of data for leaflet
states_leaflet <- states_sf %>%
sf::st_transform('+proj=longlat +datum=WGS84')
counties_leaflet <- county_sf %>%
sf::st_transform('+proj=longlat +datum=WGS84')
#Convert county data to a table
county_data <- counties_leaflet %>%
as_tibble() %>%
select(STATEFP, COUNTYFP) %>%
mutate(area_fips = paste(STATEFP, COUNTYFP, sep = ""),
st = as.numeric(STATEFP))
#Get center of each county
county_centers <- counties_leaflet %>%
filter(str_detect(NAME, "Mariana", negate = TRUE)) %>%
st_centroid() %>%
sf::st_transform('+proj=longlat +datum=WGS84')
#Load Existing Data
qcew_3digits <- readRDS(here("State Data/qcew_3digit.RDS")) %>%
mutate(industry_desc = substring(industry_title, 11))
#Simplify to get industry codes
ind_3digit <- qcew_3digits %>%
select(naics_3digit = ind_code, naics_3digit_label = industry_desc) %>%
distinct() %>%
arrange(naics_3digit_label)
#Define colors
emp_ind_vector <- c(brewer.pal(9, "Greys")[4], "#cf4633", "#BEAED4", "#FDC086", "#FDBF6F", "#386CB0", "#FB8072", brewer.pal(9, "Greys")[3], brewer.pal(9, "Greys")[8], "#80B1D3", "#F0027F", "#4DAF4A", "#F1E2CC", "#6A3D9A", "#E78AC3", "#CBD5E8", "#666666", brewer.pal(9, "Greys")[5], brewer.pal(9, "Greys")[6], "#A65628", brewer.pal(9, "Greys")[7])
This document cleans and summarizes input-output use and supply tables.
#read data
cbp_2019 <- read.csv(here("County Data/5_digit_naics_2019.csv")) %>%
mutate(area_fips = str_sub(GEO_ID, -5)) %>%
left_join(county_data) %>%
left_join(area_codes %>% select(st, state_abbr)) %>%
mutate(emp_num = as.numeric(EMP))
use_table_raw <- read.csv(here("Input Output Data/USE_TABLE_2017/2017-Table 1.csv"))
We create a crosswalk between industry codes and their descriptions.
use_cols <- colnames(use_table_raw)
ind_codes <- use_table_raw %>%
head(1) %>%
pivot_longer(-c(1,2), names_to = "industry_desc", values_to = "industry_code") %>%
select(industry_code, industry_desc)
The first step will be to convert this data into a long data format.
use_table_long <- use_table_raw[-1,] %>%
pivot_longer(-c(1,2), names_to = "industry_desc", values_to = "value", values_drop_na = TRUE) %>%
mutate(value_clean = str_remove_all(value, "[[:punct:]]"),
value_num = as.numeric(value_clean)) %>%
filter(!is.na(value_num)) %>%
rename(commodity_code = 1, commodity_desc = 2) %>%
left_join(ind_codes)
## Joining with `by = join_by(industry_desc)`
The resulting dataframe details how individuals industries use specific commodities. E.G the DEMAND for specific INPUTS from industires. We add a few additional descriptive columns to be able to segment and sort our data. Values in this data are then the total amount in millions of dollars of specific inputs used by specific industries.
use_table_clean <- use_table_long %>%
mutate(com_code_3 = as.numeric(str_sub(commodity_code, 1, 3)),
naics_3digit = as.numeric(str_sub(industry_code, 1, 3))) %>%
left_join(ind_3digit)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `com_code_3 = as.numeric(str_sub(commodity_code, 1, 3))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
## Joining with `by = join_by(naics_3digit)`
We note that some of the commodity codes refer to aggregate measures. The same is true for some of the industry codes.
agg_coms <- use_table_clean %>% select(commodity_code, commodity_desc, com_code_3) %>% distinct() %>% filter(is.na(com_code_3))
agg_coms <- agg_coms[-c(1:2),]
agg_coms %>% select(1:2)
## # A tibble: 20 × 2
## commodity_code commodity_desc
## <chr> <chr>
## 1 S00500 Federal general government (defense)
## 2 S00600 Federal general government (nondefense)
## 3 S00102 Other federal government enterprises
## 4 GSLGE State and local government (educational services)
## 5 GSLGH State and local government (hospitals and health services)
## 6 GSLGO State and local government (other services)
## 7 S00203 Other state and local government enterprises
## 8 S00401 Scrap
## 9 S00402 Used and secondhand goods
## 10 S00300 Noncomparable imports
## 11 S00900 Rest of the world adjustment
## 12 T005 Total intermediate inputs
## 13 V00100 Compensation of employees
## 14 T00OTOP Other taxes on production
## 15 V00300 Gross operating surplus
## 16 VABAS Value added (basic value)
## 17 T018 Total industry output (basic value)
## 18 T00TOP Plus: Taxes on products and imports
## 19 T00SUB Less: Subsidies
## 20 VAPRO Value added (producer value)
These codes are useful in informing aggregate measures, such as total use of inputs, total use of imports. In addition, the USE table contains valuable information about total value added by industry. Total commodity output and total industry output are both found in the USE table.
We see that the use table provides a wide variety of measures about both commodity useage, as well as industry output.
agg_ind <- use_table_clean %>% select(industry_code, industry_desc, naics_3digit) %>% distinct() %>% filter(is.na(naics_3digit))
agg_ind <- agg_ind[-c(10),]
agg_ind %>% select(1:2)
## # A tibble: 33 × 2
## industry_code industry_desc
## <chr> <chr>
## 1 S00600 Federal.general.government..nondefense.
## 2 T001 Total.Intermediate
## 3 F03000 Change.in.private.inventories
## 4 F04000 Exports.of.goods.and.services
## 5 T019 Total.use.of.products
## 6 GSLGE State.and.local.government..educational.services.
## 7 GSLGH State.and.local.government..hospitals.and.health.services.
## 8 GSLGO State.and.local.government..other.services.
## 9 F01000 Personal.consumption.expenditures
## 10 4B0000 All.other.retail
## # ℹ 23 more rows
Because of the way that these databases are constructred, the agg_ind dataframe above refers to commodity level summaries, while the agg_com dataframe above refers to industry level summaries. We spend some time at the moment to create some commodity and industry level benchmarks.
agg_coms %>%
filter(commodity_code %in% c("T005", "S00300", "VABAS", "T018", "VAPRO")) %>%
select(1:2)
## # A tibble: 5 × 2
## commodity_code commodity_desc
## <chr> <chr>
## 1 S00300 Noncomparable imports
## 2 T005 Total intermediate inputs
## 3 VABAS Value added (basic value)
## 4 T018 Total industry output (basic value)
## 5 VAPRO Value added (producer value)
We begin with creating benchmarks for value added, total output by industry, use of imports, and use of intermediary inputs.
ind_agg <- use_table_clean %>%
filter(commodity_code %in% c("T005", "S00300", "VABAS", "T018", "VAPRO")) %>%
select(commodity_code, commodity_desc, industry_code, industry_desc, value_num) %>%
pivot_wider(id_cols = c(industry_code, industry_desc), names_from = commodity_code, values_from = value_num)
colnames(ind_agg) <- c("industry_code", "industry_desc", "imports", "intermediary_inputs", "value_added", "industry_output", "prod_value_added")
This dataframe contains information, aggregated at the INDUSTRY level, about the total of use of inputs by industries, AS WELL as the overall OUTPUT and VALUE ADDED of each industry.
We can immediately produce some summary statistics about overall industry activity here. Again, this work focuses specifically on the manufacturing sector.
manf_ind <- ind_agg %>%
mutate(naics_3digit = as.numeric(str_sub(industry_code, 1, 3))) %>%
left_join(ind_3digit) %>%
filter(naics_3digit >= 300 & naics_3digit < 400) %>%
#FIX MISSING VALUES WITH 0
mutate(across(.cols = c(3:7), ~ case_when(
is.na(.) ~ 0,
TRUE ~ .)),
log_ind = log(industry_output),
log_value = log(value_added),
log_inputs = log(intermediary_inputs),
val_ratio = value_added/industry_output,
input_ratio = intermediary_inputs/industry_output,
log_imports = log(imports),
import_ratio = imports/intermediary_inputs)
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `naics_3digit = as.numeric(str_sub(industry_code, 1, 3))`.
## Caused by warning:
## ! NAs introduced by coercion
## Joining with `by = join_by(naics_3digit)`
ind_pal <- emp_ind_vector[-c(1, 17, 9)]
ind_output <- manf_ind %>%
filter(industry_output != 0) %>%
ggplot() +
geom_density_ridges(aes(x = industry_output, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
scale_x_continuous(labels = scales::comma) +
guides(fill = "none") +
labs(x = "2017 Total Industry Output (Millions of $)", y = "3 Digit NAICS Group") +
theme_bw() +
axis_theme
ind_output
## Picking joint bandwidth of 9620
ind_output_log <- manf_ind %>%
filter(industry_output != 0) %>%
ggplot() +
geom_density_ridges(aes(x = log_ind, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
scale_x_continuous(trans = scales::log_trans()) +
guides(fill = "none") +
labs(x = "Log of 2017 Total Industry Output (Millions of $)", y = "") +
theme_bw() +
axis_theme
ind_output_log
## Picking joint bandwidth of 0.0416
ind_output_comp <- ind_output + ind_output_log
ind_output_comp
## Picking joint bandwidth of 9620
## Picking joint bandwidth of 0.0416
We might also be interested in the distribution of different measures of value added across manufacturing sectors. This distribution is close to the total output distribution, but has slightly flatter peaks, and some shifted distributions.
ind_value_add <- manf_ind %>%
ggplot() +
geom_density_ridges(aes(x = log_value, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
scale_x_continuous(trans = scales::log_trans()) +
guides(fill = "none") +
labs(x = "Log of 2017 Total Industry Output (Millions of $)", y = "") +
theme_bw() +
axis_theme
ind_value_add
## Picking joint bandwidth of 0.0464
ind_value_add +
geom_density_ridges(aes(x = log_ind, y = naics_3digit_label, fill = naics_3digit_label), alpha = 0.5) +
scale_x_continuous(breaks = c(0,6, 8, 10, 12, 14, 16), labels = c(0,6, 8, 10, 12, 14, 16))
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Picking joint bandwidth of 0.398
##
## Picking joint bandwidth of 0.397
However, when we examine the ratio of value added per industry output, we see a very different distribution across manufacturing sub-sectors, with certain sub-sectors housing industries with much higher value added.
value_add_ratio <- manf_ind %>%
ggplot() +
geom_density_ridges(aes(x = val_ratio, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
guides(fill = "none") +
labs(x = "Value Added Per Output, 2017", y = "") +
theme_bw() +
axis_theme
value_add_ratio
## Picking joint bandwidth of 0.0311
We see that the distribution of use of inputs follows the distribution of total output very closely.
input_sum <- manf_ind %>%
ggplot() +
geom_density_ridges(aes(x = log_inputs, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
scale_x_continuous(trans = scales::log_trans()) +
guides(fill = "none") +
labs(x = "Log of 2017 Total Industry Output (Millions of $)", y = "") +
theme_bw() +
axis_theme +
scale_x_continuous(breaks = c(0,6, 8, 10, 12, 14, 16), labels = c(0,6, 8, 10, 12, 14, 16))
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
input_sum
## Picking joint bandwidth of 0.388
input_sum +
geom_density_ridges(aes(x = log_value, y = naics_3digit_label, fill = naics_3digit_label), alpha = 0.8) +
geom_density_ridges(aes(x = log_ind, y = naics_3digit_label, fill = naics_3digit_label), alpha = 0.3, color = "white") +
scale_x_continuous(breaks = c(0,6, 8, 10, 12, 14, 16), labels = c(0,6, 8, 10, 12, 14, 16)) +
theme_dark()
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Picking joint bandwidth of 0.388
##
## Picking joint bandwidth of 0.398
##
## Picking joint bandwidth of 0.397
However, when examining the intensity of use of intermediary inputs (e.g. the ratio of intermediary inputs to output), the distribution of each 3 Digit NAICS Manufacturing sector changes substantially.
input_ratio <- manf_ind %>%
ggplot() +
geom_density_ridges(aes(x = input_ratio, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
guides(fill = "none") +
labs(x = "2017 Ratio of Intermeidary Inputs to Industry Output", y = "") +
theme_bw() +
axis_theme
input_ratio
## Picking joint bandwidth of 0.0311
input_ratio +
geom_density_ridges(aes(x = val_ratio, y = naics_3digit_label, fill = naics_3digit_label), alpha = 0.7)
## Picking joint bandwidth of 0.0311
## Picking joint bandwidth of 0.0311
We conclude these summary statistics by examining the intensity of use of imports by manufacturing sub-sector.
import_sum <- manf_ind %>%
filter(imports != 0) %>%
ggplot() +
geom_density_ridges(aes(x = log_imports, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
guides(fill = "none") +
labs(x = "2017 Log Imports", y = "") +
theme_bw() +
axis_theme
import_sum
## Picking joint bandwidth of 0.508
input_sum +
geom_density_ridges(aes(x = log_value, y = naics_3digit_label, fill = naics_3digit_label), alpha = 0.8) +
geom_density_ridges(aes(x = log_ind, y = naics_3digit_label, fill = naics_3digit_label), alpha = 0.3, color = "white") +
geom_density_ridges(data = manf_ind %>%
filter(imports != 0),
aes(x = log_imports, y = naics_3digit_label, fill = naics_3digit_label), color = "grey") +
scale_x_continuous(breaks = c(0,6, 8, 10, 12, 14, 16), labels = c(0,6, 8, 10, 12, 14, 16)) +
theme_dark()
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
## Picking joint bandwidth of 0.388
##
## Picking joint bandwidth of 0.398
##
## Picking joint bandwidth of 0.397
##
## Picking joint bandwidth of 0.508
import_rat <- manf_ind %>%
ggplot() +
geom_density_ridges(aes(x = import_ratio, y = naics_3digit_label, fill = naics_3digit_label)) +
scale_fill_manual(values = ind_pal) +
guides(fill = "none") +
labs(x = "2017 Ratio of Imports to Intermediate Inputs", y = "") +
theme_bw() +
axis_theme
import_rat
## Picking joint bandwidth of 0.00136
We see that the ratio of intermediary inputs to total output and the ratio of value added to total output, appear to almost mirror each other, supporting the observation that because of how input-output accounting is done, the ratio of intermediary inputs to total output also captures value added in the final production step.
We now create some commodity level summaries, detailing the total use of specific commodities as intermediary inputs, compared to final products. In addition, we include measures of how many of these commodities are exported,
com_agg <- use_table_clean %>%
filter(industry_code %in% c("T001", "F04000", "T019")) %>%
select(commodity_code, commodity_desc, industry_code, industry_desc, value_num) %>%
pivot_wider(id_cols = c(commodity_code, commodity_desc), names_from = industry_desc, values_from = value_num)
colnames(com_agg) <- c("commodity_code", "commodity_desc", "intermediate", "exports", "total_use")
This dataframe has COMMODITY level summaries. We can use commodity level summaries to control for the input-intensity of a given commodity. ****INSERT HERE: Commodities that have a higher share of intermediate use to total use, are
Then, we can look at the use of specific commodities by industries to understand which industries are using relatively input-intensive *****
We now have aggregate commodity and industry level summaries, and can simplify our original dataframe. We will also need to perform similar aggregation measures in the supply table (the make table)
At this point, we might also want to separate out our commodities into some broad categories. While this categorization might be too granular, it at least allows us some baseline to assess industry use of specific types of inputs. This categorization is based on the NAICS codes used in the SUPPLY/MAKE tables, and roughly varies at the 3-digit level. A few specific inputs, such as R&D, are called out.
use_table_final <- use_table_clean %>%
filter(!is.na(com_code_3), !is.na(naics_3digit)) %>%
mutate(input_type = case_when(
com_code_3 < 113 ~ "Agriculture and Farming",
com_code_3 == 113 | com_code_3 == 115 ~ "Forestry and Logging",
com_code_3 == 114 ~ "Fishing, Hunting, Trapping",
com_code_3 == 211 ~ "Oil and Gas Extraction",
com_code_3 == 212 ~ "Metal Mining",
com_code_3 == 213 ~ "Support for Oil, Gas, and Mining",
com_code_3 == 221 ~ "Electricity, Gas, Water, Sewage, and other systems",
com_code_3 >= 300 & com_code_3 < 400 ~ "Manufactured Input",
com_code_3 >= 400 & com_code_3 < 460 ~ "Other wholesalers, suppliers, retailers",
com_code_3 >= 481 & com_code_3 < 490 ~ "Transportation",
com_code_3 >= 491 & com_code_3 < 493 ~ "Postal Service and other messaging",
com_code_3 >= 500 & com_code_3 < 600 ~ "Software, Information, Financial, Legal, and other services",
com_code_3 >= 600 ~ "Colleges, Universities, Educational Services, Hospitals, Medical and Diagnostic Labs, and Other Health Care Services",
com_code_3 == 811 | com_code_3 == 230 ~ "Maintenance & Repair",
com_code_3 == 813 ~ "Grantmaking and Philanthropy",
TRUE ~ "Other inputs (e.g. arts & recreation; accomodation, etc.)"
)) %>%
mutate(input_type = case_when(
commodity_code == "423100" ~ "Motor vehicle and motor vehicle parts and supplies",
commodity_code == "423800" ~ "Machinery, Equipment, Supplies",
commodity_code == "424700" ~ "Petroleum and Petroleum Products",
commodity_code == "541700" ~ "Scientific research and development services",
TRUE ~ input_type
))
#Create dataframe for ordering for later
com_order <- use_table_final %>%
group_by(input_type) %>%
reframe(com_sum = sum(value_num, na.rm = TRUE)) %>%
arrange(input_type) %>%
mutate(alpha_order = seq(n())) %>%
arrange(desc(com_sum)) %>%
mutate(num_order = seq(n()))
We now start by creating some high level summary statistics by industry, focusing specifically on the manufacturing sector. Previous work has categorized 6-digit NAICS sectors into traded, or local (Delgado et al., 2014); or “supply-chain” industries (Delgado and Mills, 2020), but explicitly does not consider the supply chain of specific companies (e.g. GM), or specific industries. In this work, we seek to map out specific supply chains for specific industries, and examine how supply chains organize differently across regions. In the above section, we already provided summary statistics about the total use of inputs by industries. Now, we want to specifically focus on the type of inputs that manufacturers require.
manf_inputs <- use_table_final %>%
filter(naics_3digit >= 300 & naics_3digit < 400) %>%
select(-c(value, value_clean)) %>%
left_join(manf_ind) %>%
mutate(input_share = value_num / intermediary_inputs)
## Joining with `by = join_by(industry_desc, industry_code, naics_3digit,
## naics_3digit_label)`
In total, we have, ‘r manf_inputs %>% select(industry_desc) %>% distinct() %>% nrow()’ (232) different types of manufacturing, at the 6-digit level. To make sense of the variation in use of inputs by manufacturing sub-sector, we start with a 3-level by 3-level summary, acknowledging that there is substantial variation in use of inputs by the different manufacturing sector.
input_list <- manf_inputs %>%
select(input_type) %>%
distinct() %>%
unlist()
input_agg_share <- manf_inputs %>%
group_by(industry_code, industry_desc, naics_3digit_label, input_type) %>%
reframe(input_share = sum(input_share, na.rm = TRUE), intermediary_inputs, value = sum(value_num, na.rm = TRUE)) %>%
distinct()
manf_codes <- manf_inputs %>%
select(naics_3digit_label) %>%
distinct() %>%
unlist()
ExpandColorsLIGHT <- function(colors, n, steps = 11){
if(n <= steps){
suppressWarnings({
sapply(colors, function(x){colorRampPalette(c(x, "#FFFFFF"))(steps)}) %>%
as.data.frame() %>%
filter(row_number() <= n) %>%
gather(key = original.color, value = expanded.color)
})
}else{
warning("Select n < steps!")
}
}
ExpandColorsDARK <- function(colors, n, steps = 11){
if(n <= steps){
suppressWarnings({
sapply(colors, function(x){colorRampPalette(c(x, "#000000"))(steps)}) %>%
as.data.frame() %>%
filter(row_number() <= n) %>%
gather(key = original.color, value = expanded.color)
})
}else{
warning("Select n < steps!")
}
}
input_list %>% unname()
## [1] "Agriculture and Farming"
## [2] "Forestry and Logging"
## [3] "Fishing, Hunting, Trapping"
## [4] "Oil and Gas Extraction"
## [5] "Metal Mining"
## [6] "Electricity, Gas, Water, Sewage, and other systems"
## [7] "Maintenance & Repair"
## [8] "Manufactured Input"
## [9] "Motor vehicle and motor vehicle parts and supplies"
## [10] "Other wholesalers, suppliers, retailers"
## [11] "Machinery, Equipment, Supplies"
## [12] "Petroleum and Petroleum Products"
## [13] "Transportation"
## [14] "Postal Service and other messaging"
## [15] "Other inputs (e.g. arts & recreation; accomodation, etc.)"
## [16] "Software, Information, Financial, Legal, and other services"
## [17] "Scientific research and development services"
## [18] "Colleges, Universities, Educational Services, Hospitals, Medical and Diagnostic Labs, and Other Health Care Services"
input_cols <- c("#4DAF4A", "#D8CBB7", "#C7E1C6", "#7E5F37", "#252525", "#F3C38F" ,"#80B1D3", "#597B93", "#954D24", "#969696", "#386CB0","#E78AC3", "#B1865D", "#D9D9D9", "#BDBDBD", "#cf4633", "#FB8072" , "#703933")
input_col_map <- data.frame(input_type = input_list %>% unname(), input_cols) %>%
mutate(tal = seq(n()))
com_order <- left_join(input_col_map, com_order)
## Joining with `by = join_by(input_type)`
# Map Inputs to Colors
get_col_vec <- function(data){
data %>%
select(input_type) %>%
distinct() %>%
unlist()
}
#Function to get appropriate input colors
get_col_match <- function(data){
col_vec <- get_col_vec(data)
col_out <- input_col_map %>%
filter(input_type %in% col_vec) %>%
select(input_cols) %>%
unlist() %>%
unname()
return(col_out)
}
make_density <- function(number){
df <- input_agg_share %>%
filter(naics_3digit_label %in% manf_codes[number], !is.na(input_share)) %>%
mutate(val_round = round(input_share, 5)) %>%
filter(val_round > 0) %>%
group_by(input_type) %>%
mutate(count = n()) %>%
filter(count >= 3) %>%
ungroup() %>%
left_join(com_order)
col_vec_1 <- df %>%
get_col_match()
# col_vec_2 <- df %>%
# filter(count <3) %>%
# get_col_match()
df %>%
ggplot(aes(x = input_share, y = reorder(input_type, -num_order), fill = reorder(input_type, tal))) +
geom_density_ridges(alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
geom_text(data = df %>% select(input_type, num_order, tal) %>% distinct(), aes(x = 0.35, y = reorder(input_type, -num_order), label = input_type), color = "black", nudge_y = -.2) +
scale_fill_manual(values = col_vec_1, guide = "none") +
# ggnewscale::new_scale_fill() +
# geom_point(data = df %>% filter(count <3), aes(), shape = 21) +
# scale_fill_manual(values = col_vec_2, guide = "none") +
theme_bw() +
labs(x = "", y = "", title = manf_codes[number], fill = "")
}
make_density_simple <- function(number){
df <- input_agg_share %>%
filter(naics_3digit_label %in% manf_codes[number], !is.na(input_share)) %>%
# group_by(input_type) %>%
# reframe(inputs_total = sum(intermediary_inputs), individual_input = sum(value)) %>%
# mutate(input_share = individual_input / inputs_total) %>%
left_join(com_order)
col_vec_1 <- df %>%
get_col_match()
# col_vec_2 <- df %>%
# filter(count <3) %>%
# get_col_match()
df %>%
ggplot(aes(x = input_share, y = reorder(input_type, -num_order), fill = reorder(input_type, tal)), group = industry_desc) +
geom_col(alpha = 0.7, color = "black", position = "stack") +
geom_text(data = df %>% select(input_type, num_order, tal) %>% distinct(), aes(x = 0.35, y = reorder(input_type, -num_order), label = input_type), color = "black", nudge_y = -.2) +
scale_fill_manual(values = col_vec_1, guide = "none") +
guides(color = "none") +
# ggnewscale::new_scale_fill() +
# geom_point(data = df %>% filter(count <3), aes(), shape = 21) +
# scale_fill_manual(values = col_vec_2, guide = "none") +
theme_bw() +
labs(x = "", y = "", title = manf_codes[number], fill = "") +
theme(legend.position = "bottom")
}
manf_codes %>% unname()
## [1] "Food manufacturing"
## [2] "Chemical manufacturing"
## [3] "Beverage and tobacco product manufacturing"
## [4] "Miscellaneous manufacturing"
## [5] "Textile mills"
## [6] "Textile product mills"
## [7] "Wood product manufacturing"
## [8] "Nonmetallic mineral product manufacturing"
## [9] "Furniture and related product manufacturing"
## [10] "Apparel manufacturing"
## [11] "Leather and allied product manufacturing"
## [12] "Paper manufacturing"
## [13] "Printing and related support activities"
## [14] "Plastics and rubber products manufacturing"
## [15] "Primary metal manufacturing"
## [16] "Fabricated metal product manufacturing"
## [17] "Transportation equipment manufacturing"
## [18] "Petroleum and coal products manufacturing"
## [19] "Electrical equipment, appliance, and component manufacturing"
## [20] "Machinery manufacturing"
## [21] "Computer and electronic product manufacturing"
food_manf <- make_density(1)
## Joining with `by = join_by(input_type)`
chem_manf <- make_density(2)
## Joining with `by = join_by(input_type)`
bev_manf <- make_density(3)
## Joining with `by = join_by(input_type)`
misc_manf <- make_density(4)
## Joining with `by = join_by(input_type)`
text_manf <- make_density(5)
## Joining with `by = join_by(input_type)`
text_prod_manf <- make_density(6)
## Joining with `by = join_by(input_type)`
wood_manf <- make_density(7)
## Joining with `by = join_by(input_type)`
non_metal_min_manf <- make_density(8)
## Joining with `by = join_by(input_type)`
furn_manf <- make_density(9)
## Joining with `by = join_by(input_type)`
apparel_manf <- make_density_simple(10)
## Joining with `by = join_by(input_type)`
leather_manf <- make_density_simple(11)
## Joining with `by = join_by(input_type)`
paper_manf <- make_density(12)
## Joining with `by = join_by(input_type)`
printing <- make_density_simple(13)
## Joining with `by = join_by(input_type)`
plastics_manf <- make_density(14)
## Joining with `by = join_by(input_type)`
metal_manf <- make_density(15)
## Joining with `by = join_by(input_type)`
fab_metal_manf <- make_density(16)
## Joining with `by = join_by(input_type)`
transport_equipment_manf <- make_density(17)
## Joining with `by = join_by(input_type)`
petro_manf <- make_density(18)
## Joining with `by = join_by(input_type)`
ee_appliance_component_manf <- make_density(19)
## Joining with `by = join_by(input_type)`
machine_manf <- make_density(20)
## Joining with `by = join_by(input_type)`
comp_ee_manf <- make_density(21)
## Joining with `by = join_by(input_type)`
Across food, miscellaneous, and beverage manufacturing, the distribution of the use of manufactured inputs is fairly flat, but wide for food manufacturing, higher and shifted right for miscellaneous manufacturing, as well as beverage and tobacco manufacturing. Miscellaneous manufacturing and beverage and tobacco manufacturing both also have a higher ratio of software, financial, legal, and other information services. The use of agricultural/farming as well fishing/hunting inputs is also not surprising.
g1 <- food_manf + theme(axis.text.y = element_blank())
g2 <- misc_manf + theme(axis.text.y = element_blank())
g3 <- bev_manf + theme(axis.text.y = element_blank())
gA <- g1 + g2 + g3
gA
## Picking joint bandwidth of 0.0193
## Picking joint bandwidth of 0.00632
## Picking joint bandwidth of 0.0189
Across chemical, plastics & rubber, and petroleum and coal product manufacturing, both chemical as well as petroleum and coal product manufacturing use mined inputs. One of the subsectors in petroleum and coal product manufacturing uses fewer manufactured inputs. There is some use of warehousing in chemical manufacturing, and both chemical manufacturing as well as plastics and rubber product manufacturing have some sub-sectors with a high use of electricity, water, sewage, gas, and other services.
g4 <- chem_manf + theme(axis.text.y = element_blank())
g5 <- plastics_manf + theme(axis.text.y = element_blank())
g6 <- petro_manf + theme(axis.text.y = element_blank())
gB <- g4 + g5 + g6
gB
## Picking joint bandwidth of 0.00819
## Picking joint bandwidth of 0.00554
## Picking joint bandwidth of 0.023
Across paper, printing, and leather manufacturing, the use of manufactured inputs, as well as electricity, gas, sewage, water and other services, is rather high, along with the use of software and other financial services. Unsurprisingly, paper manufacturing has high use of forestry and logging. Printing and leather manufacturing only have two and one sub-industry, respectively.
g7 <- paper_manf + theme(axis.text.y = element_blank())
g8 <- printing + theme(axis.text.y = element_blank())
g9 <- leather_manf + theme(axis.text.y = element_blank())
gC <- g7 + g8 + g9
gC
## Picking joint bandwidth of 0.0136
Wood product manufacturing consumes forestry and logging inputs, and not all sub-sectors have a high useage of manufactured inputs, while furniture manufacturing has a high use of manufactured inputs. There is only one sub-sector in apparel manufacturing.
g10 <- wood_manf + theme(axis.text.y = element_blank())
g11 <- furn_manf + theme(axis.text.y = element_blank())
g12 <- apparel_manf + theme(axis.text.y = element_blank())
gD <- g10 + g11 + g12
gD
## Picking joint bandwidth of 0.0156
## Picking joint bandwidth of 0.00513
Across textile mills, textile product mills, and machinery manufacturing, the use of manufactured inputs is high.
g13 <- text_manf + theme(axis.text.y = element_blank())
g14 <- text_prod_manf + theme(axis.text.y = element_blank())
g15 <- machine_manf + theme(axis.text.y = element_blank())
gE <- g13 + g14 + g15
gE
## Picking joint bandwidth of 0.0131
## Picking joint bandwidth of 0.00929
## Picking joint bandwidth of 0.00621
Across fab-metal manufacturing, metal manufacturing, and non-metallic minerals manufacturing, the use of metal mining is high in the later two, the use of manufactured inputs is high in fabricated metal manufacturing and more distributed in the other two, and both of the later two also have some subsectors that use eletricity, gas, water, sewage, and other services.
g16 <- fab_metal_manf + theme(axis.text.y = element_blank())
g17 <- metal_manf + theme(axis.text.y = element_blank())
g18 <- non_metal_min_manf + theme(axis.text.y = element_blank())
gF <- g16 + g17 + g18
gF
## Picking joint bandwidth of 0.00436
## Picking joint bandwidth of 0.0178
## Picking joint bandwidth of 0.0174
Across electricial equipment; computer and electonic, and transportation equipment manufacturing, we again see heterogeneity in the use of software type services, as well as manufactured inputs. Electrical equipment as well as computer and electronic manufacturing both use more wholesalers.
g19 <- ee_appliance_component_manf + theme(axis.text.y = element_blank())
g20 <- comp_ee_manf + theme(axis.text.y = element_blank())
g21 <- transport_equipment_manf + theme(axis.text.y = element_blank())
gG <- g19 + g20 + g21
gG
## Picking joint bandwidth of 0.00573
## Picking joint bandwidth of 0.00819
## Picking joint bandwidth of 0.00544
Overall, we can make some broad observations above the distribution of the use of inputs by 6-digit manufacturing sub-sectors. There is wide variation in the use of software, information, financial, legal, and other such services across sub-sectors. Overall, as 3-digit level industries use specific inputs, the heterogeneity of use of specific types of inputs also appears to increase as well.
input_sum_all <- input_agg_share %>%
filter(!is.na(input_share)) %>%
mutate(val_round = round(input_share, 5)) %>%
filter(val_round > 0) %>%
group_by(input_type) %>%
mutate(count = n()) %>%
filter(count >= 3) %>%
ungroup() %>%
left_join(com_order) %>%
ggplot(aes(x = input_share, y = reorder(input_type, -num_order), fill = naics_3digit_label, color = reorder(input_type, tal))) +
geom_density_ridges(alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
geom_point(aes(), shape = 21) +
scale_fill_manual(values = emp_ind_vector, guide = "none") +
theme_bw() +
labs(x = "", y = "", title = "", fill = "") +
guides(color = "none") +
scale_color_manual(values = input_cols, guide = "none") +
axis_theme
## Joining with `by = join_by(input_type)`
input_sum_all
## Picking joint bandwidth of 0.0122
While we acknowledge that there is substantial heterogeneity in the use of specific inputs across industries, it is nonetheless helpful to aggregate to the 3-digit level to describe certain archetypes of industries (e.g. those that do/do not use raw inputs).
We see that some sectors use more software and related services (e.g. leather and allied product manuafcturing, apparel manufacturing) on average, and that most sectors, with the exception of petroleum and coal product manufacturing, use a large amount of manufactured inputs. Further breaking down the types of manufactured inputs used will be very important.
Only a few industries (chemical, food, textile), use agricultural products, and a different set of industries (plastics, wood, paper), use lumber products. Only some industries use mined metals.
input_sum_3digit <- input_agg_share %>%
group_by(naics_3digit_label, input_type) %>%
reframe(total_input = sum(intermediary_inputs), int_input = sum(value), mean_share = mean(input_share), sd_share = sd(input_share), min_share = min(input_share), max_share = max(input_share)) %>%
mutate(input_share = int_input/total_input) %>%
left_join(com_order) %>%
ggplot() +
geom_col(aes(x = input_share, y = reorder(naics_3digit_label, num_order), fill = reorder(input_type, tal)), color = "black", alpha = 0.8) +
scale_fill_manual(values = input_cols) +
guides(fill = "none") +
theme_bw() +
labs(x = "Share of total industry inputs by commodity type", y = "NAICS 3-digit Manufacturing Sub-sector")
## Joining with `by = join_by(input_type)`
ggplotly(input_sum_3digit)
input_agg_share %>%
group_by(naics_3digit_label, input_type) %>%
reframe(total_input = sum(intermediary_inputs), int_input = sum(value), mean_share = mean(input_share), sd_share = sd(input_share), min_share = min(input_share), max_share = max(input_share)) %>%
mutate(input_share = int_input/total_input) %>%
left_join(com_order) %>%
ggplot() +
geom_col(aes(x = mean_share, y = reorder(naics_3digit_label, num_order), fill = reorder(input_type, tal)), color = "black", alpha = 0.8) +
scale_fill_manual(values = input_cols) +
guides(fill = "none") +
theme_bw() +
labs(x = "Mean Share of total industry inputs by commodity type", y = "NAICS 3-digit Manufacturing Sub-sector", title = "Mean Share across manufacturing sub-sectors")
At this point, we are interested in the similarity of inputs across and among manufacturing industries. To understand this similarity, we turn to clustering algorithms to group 6-digit manufacturing NAICS sectors into fewer groups based on similarity between inputs.
Let \(X_i\) be the \(M\) dimensional vector of inputs that are used by industry \(i\). There are then multiple measures of the similarity between industry \(i\) and any other industry \(j\), including the L1 norm, the L2 norm, and the correlation distance.
To prepare our data to understand these values, we have to transform the shape first, so that each industry is a row, and the column is identifier of the commodity used by that industry. Here, we want to maintain maxiumum granularity of both commodities as well as industries. As such, we return to the manf_inputs dataframe and create two sparse matricies: one for the total value of inputs used, and one for the share of inputs as a percentage of the total intermediary inputs used by an industry.
set.seed(37)
input_cluster_raw <- manf_inputs %>%
select(commodity_code, commodity_desc, industry_desc, industry_code, value_num, input_share) %>%
pivot_wider(id_cols = c(industry_desc, industry_code), values_from = value_num, names_from = commodity_code, values_fill = 0)
input_cluster_share <- manf_inputs %>%
select(commodity_code, commodity_desc, industry_desc, industry_code, value_num, input_share) %>%
pivot_wider(id_cols = c(industry_desc, industry_code), values_from = input_share, names_from = commodity_code, values_fill = 0)
input_cluster_share$ind_code_num <- as.numeric(sub("A.*", "", input_cluster_share$industry_code))
## Warning: NAs introduced by coercion
We create a 232x326 matrix of 232 manufacturing industries (i), and their use of 326 commodities (c).
input_raw_matrix <- input_cluster_share %>%
select(-c(industry_code, industry_desc, ind_code_num)) %>%
as.matrix()
input_cluster_matrix <- input_cluster_raw %>%
select(-c(industry_code, industry_desc)) %>%
as.matrix()
Using this matrix, we can create our first, initial clustering, using k-means clustering, and benchmarking against the number of 3-digit manufacturing codes: 21.
k_means_1 <- kmeans(x=input_cluster_matrix, centers=21)
However, we might want to use hierarchical clustering instead of k-means clustering. In this case, we use the default measure of average distance, where the distance between two clusters is defined as the average distance between any members of the two clusters.
We first plot a clustering output using euclidean distances between vectors of inputs.
h_clust_1 <- hclust(d=dist(input_cluster_matrix))
plot(h_clust_1)
We then consider how using correlation distance changes the results of the clustering. Moving forward, we will primary focus on the second distance measure.
h_clust_raw <- hclust(d = as.dist(1-cor(t(input_raw_matrix))))
h_clust_2 <- hclust(d = as.dist(1-cor(t(input_cluster_matrix))))
plot(h_clust_2, xlab = "Correlation Distance Hierarchal Clustering")
plot(h_clust_2, xlab = "Correlation Distance Hierarchal Clustering")
rect.hclust(h_clust_2, k=21, border="red")
plot(h_clust_raw)
rect.hclust(h_clust_raw, k=21, border="red")
Thus, we see how manufacturing sub-sectors are placed into clusters based on their use of inputs. We now compare these clustering results to the 3-digit NAICS code that the various manufacturing sub-sectors belong to.
h_clust <- h_clust_2 %>% cutree(k = 21)
k_clust <- k_means_1$cluster
cluster_assignments <- data.frame(
ind_desc = input_cluster_share$industry_desc,
ind_code = input_cluster_share$industry_code,
h_clusters = h_clust,
k_clusters = k_clust
)
Comparing how 6-digit NAICS manufacturing sectors are grouped together at the 3-digit NAICS level with a new clustering assignment that groups manufacturing sectors together based on shared inputs, we see that sub-sectors that were formerly grouped together at the 3-digit NAICS level are no longer paired together.
This is only the first iteration of this approach, and we see evidence that certain clusters may need to be further disaggregated (e.g. cluster 5, 9, 14, 17, and 20). However, groupings from this first pass are certainly intriguing.
h_clust <- cluster_assignments %>%
mutate(naics_3digit = as.numeric(str_sub(ind_code, 1, 3))) %>%
left_join(ind_3digit) %>%
ggplot() +
geom_col(aes(x = h_clusters, y = naics_3digit, fill = naics_3digit_label, label = ind_desc), position = "stack") +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "Hierarchical Clustering Assignment", y = "3-Digit NAICS Code") +
scale_x_continuous(breaks = seq(1, 21)) +
theme_bw() +
theme(axis.text.y = element_blank()) +
axis_theme
## Joining with `by = join_by(naics_3digit)`
## Warning in geom_col(aes(x = h_clusters, y = naics_3digit, fill =
## naics_3digit_label, : Ignoring unknown aesthetics: label
h_clust
We also see further evidence that the hierarchical clustering approach
is superior to the k-means clustering approach.
k_clust <- cluster_assignments %>%
mutate(naics_3digit = as.numeric(str_sub(ind_code, 1, 3))) %>%
left_join(ind_3digit) %>%
ggplot() +
geom_col(aes(x = k_clusters, y = naics_3digit, fill = naics_3digit_label)) +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "K-Means Clustering Assignment", y = "3-Digit NAICS Code") +
scale_x_continuous(breaks = seq(1, 21)) +
theme_bw() +
theme(axis.text.y = element_blank()) +
axis_theme +
axis_theme
## Joining with `by = join_by(naics_3digit)`
k_clust
We can spend some time investigating how 6-digit manufacturing sectors are assigned to new clusters.
ggplotly(h_clust)
We see that different types of food manufacturing (pink), are separated out. It might make sense to group these together. There is a distinct food chemical manufacturing cluster (4), and a plastics and petroleum cluster (5). Curiously, specific types of niche chemical manufacturing (fertilizer and industrial gas), are grouped with other non-metalli mineral porduct manufacturing, as well as paper manufacturing and printing (cluster 6).
There is a curious cluster that appears to be related to “retail goods” at cluster 9, which seems to combine some industrial electrical equipment manufacturing with industrial process, packaging machinery, and a variety of miscellaneous manufacturing. This cluster is a prime candidate for further review/re-clustering.
Cluster 14 is a large industrial cluster, and seems to largely cover component manufacturing. Cluster 16 is a seperate, seemingly more defense releated, machinery, metals, and component manufacturing cluster.
Cluster 17 is a clear transportation equipment cluster, although certain machinery such as lawn and garden equipment appear in this cluster.
Cluster 20 is then a aerospace and computer electronics cluster, while cluster 21 is an aircraft manufacturing cluster.
There is clearly need to alter the clustering algorithm to prune slightly differently, which will be explored later. We can get some additional information about heterogeneity in clusters by varying the total number of clusters.
h_15 <- cluster_assignments %>%
mutate(h_clust_15 = cutree(h_clust_2, k = 15)) %>%
mutate(naics_3digit = as.numeric(str_sub(ind_code, 1, 3))) %>%
left_join(ind_3digit) %>%
ggplot() +
geom_col(aes(x = h_clust_15, y = naics_3digit, fill = naics_3digit_label, label = ind_desc), position = "stack") +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "Hierarchical Clustering Assignment (15)", y = "3-Digit NAICS Code") +
theme_bw() +
theme(axis.text.y = element_blank()) +
axis_theme
## Joining with `by = join_by(naics_3digit)`
## Warning in geom_col(aes(x = h_clust_15, y = naics_3digit, fill =
## naics_3digit_label, : Ignoring unknown aesthetics: label
h_30 <- cluster_assignments %>%
mutate(h_clust_30 = cutree(h_clust_2, k = 30)) %>%
mutate(naics_3digit = as.numeric(str_sub(ind_code, 1, 3))) %>%
left_join(ind_3digit) %>%
ggplot() +
geom_col(aes(x = h_clust_30, y = naics_3digit, fill = naics_3digit_label, label = ind_desc), position = "stack") +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "Hierarchical Clustering Assignment (30)", y = "3-Digit NAICS Code") +
theme_bw() +
theme(axis.text.y = element_blank()) +
axis_theme
## Joining with `by = join_by(naics_3digit)`
## Warning in geom_col(aes(x = h_clust_30, y = naics_3digit, fill =
## naics_3digit_label, : Ignoring unknown aesthetics: label
h_range <- h_15 + h_30
h_range
We might also want to benchmark how clusters perform when using the normalized input_share measure, as opposed to raw inputs.
cluster_raw <- cluster_assignments %>%
mutate(cluster_raw_21 = cutree(h_clust_raw, k = 21)) %>%
mutate(naics_3digit = as.numeric(str_sub(ind_code, 1, 3))) %>%
left_join(ind_3digit) %>%
ggplot() +
geom_col(aes(x = cluster_raw_21, y = naics_3digit, fill = naics_3digit_label, label = ind_desc), position = "stack") +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "Hierarchical Clustering Assignment (Raw Values)", y = "3-Digit NAICS Code") +
theme_bw() +
theme(axis.text.y = element_blank()) +
axis_theme
## Joining with `by = join_by(naics_3digit)`
## Warning in geom_col(aes(x = cluster_raw_21, y = naics_3digit, fill =
## naics_3digit_label, : Ignoring unknown aesthetics: label
cluster_raw
We see, perhaps unsurprisngly, that the clustering assignment are robust
to raw v. scaled values.
We can continue to validate this approach by comparing the distribution of use of inputs across the new clusters, as we did for NAICS 3-digit codes. We see that while the overall distribution of the use of inputs by 6-digit sub-sectors at the 3-digit level is tighter, this distributions are still sometimes large, especially for manufactured inputs. As such, there might be some value in just clustering over the use of manufactured inputs.
ExpandColorsSHIFT <- function(colors, n, steps = 11){
if(n <= steps){
suppressWarnings({
sapply(colors, function(x){colorRampPalette(c(x, "#66C2A5"))(steps)}) %>%
as.data.frame() %>%
filter(row_number() <= n) %>%
gather(key = original.color, value = expanded.color)
})
}else{
warning("Select n < steps!")
}
}
h_cols <- ExpandColorsSHIFT(emp_ind_vector, 4)
h_cols <- h_cols %>%
group_by(original.color) %>%
mutate(tal = seq(n())) %>%
filter(tal == 4) %>%
ungroup() %>%
select(expanded.color) %>%
unlist() %>%
unname()
new_clusters_input <- input_agg_share %>%
rename(ind_desc = industry_desc, ind_code = industry_code) %>%
filter(!is.na(input_share)) %>%
mutate(val_round = round(input_share, 5)) %>%
filter(val_round > 0) %>%
group_by(input_type) %>%
mutate(count = n()) %>%
filter(count >= 3) %>%
left_join(cluster_assignments) %>%
ungroup() %>%
left_join(com_order) %>%
ggplot(aes(x = input_share, y = reorder(input_type, -num_order), fill = as.factor(h_clusters), color = reorder(input_type, tal))) +
geom_density_ridges(alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
scale_fill_manual(values = h_cols, guide = "none") +
theme_bw() +
labs(x = "", y = "", title = "", fill = "") +
guides(color = "none") +
scale_color_manual(values = input_cols, guide = "none") +
axis_theme
## Joining with `by = join_by(ind_code, ind_desc)`
## Joining with `by = join_by(input_type)`
new_clusters_input
## Picking joint bandwidth of 0.0158
Aggregating across our input clusters, we see that we might have to do some correcting for shares. However, the distribution of use of inputs across the new clusters are substantially different than the use of inputs at the NAICS 3-digit level.
h_cluster_all <- input_agg_share %>%
rename(ind_desc = industry_desc, ind_code = industry_code) %>%
left_join(cluster_assignments) %>%
group_by(h_clusters, input_type) %>%
reframe(total_input = sum(intermediary_inputs), int_input = sum(value), mean_share = mean(input_share), sd_share = sd(input_share), min_share = min(input_share), max_share = max(input_share)) %>%
mutate(input_share = int_input/total_input) %>%
left_join(com_order) %>%
ggplot() +
geom_col(aes(x = input_share, y = reorder(h_clusters, num_order), fill = reorder(input_type, tal)), color = "black", alpha = 0.8) +
scale_fill_manual(values = input_cols) +
guides(fill = "none") +
theme_bw() +
labs(x = "Share of total industry inputs by commodity type", y = "H-Clustering by Inputs") +
axis_theme
## Joining with `by = join_by(ind_code, ind_desc)`
## Joining with `by = join_by(input_type)`
h_cluster_all
Our next step will be to refine this clustering algorithm, introduce some pruning, and label individual clusters based on the NAICS 6-digit subs-sectors and inputs they contain. Ideally, we will also name clusters based on the outputs that they produce.
Below, we compare the use of inputs across our newly defined clusters to the NAICS 3-digit level clustesr.
h_cluster_all + input_sum_3digit
We will return to this clustering algorithm, ideally to adjust the pruning and clustering, later.
At this point, to complete our picture of the use of various inputs, we might want to see how other industries use manufactured inputs.
manf_use <- use_table_final %>%
filter(com_code_3 >= 300 & com_code_3 < 400) %>%
filter(is.na(naics_3digit_label)) %>%
group_by(naics_3digit, commodity_code, commodity_desc, com_code_3) %>%
reframe(value = sum(value_num))
We have to return to our aggregate dataframe to create a benchmark against total outputs and use of inputs at the NAICS 3-digit level.
ind_3level_sum <- ind_agg %>%
mutate(naics_3digit = as.numeric(str_sub(industry_code, 1, 3))) %>%
filter(naics_3digit < 300 | naics_3digit >= 400) %>%
group_by(naics_3digit) %>%
reframe(total_output = sum(industry_output), total_inputs = sum(intermediary_inputs))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `naics_3digit = as.numeric(str_sub(industry_code, 1, 3))`.
## Caused by warning:
## ! NAs introduced by coercion
manf_use_final <- manf_use %>%
left_join(ind_3level_sum) %>%
mutate(manf_ratio = value/total_inputs)
## Joining with `by = join_by(naics_3digit)`
We see that certain non-manufacturing sectors use more than 10% of their total inputs as manufactured inputs. As such, we might need to consider these industries later, to identify when manufacturers are co-locating with end-use customers.
manf_input_sum <- manf_use_final %>%
left_join(ind_3digit %>%
rename(com_code_3 = naics_3digit,
manf_desc = naics_3digit_label)) %>%
ggplot() +
geom_col(aes(x = manf_ratio, y = as.factor(naics_3digit), fill = manf_desc)) +
scale_fill_manual(values = emp_ind_vector, guide = "none") +
labs( x = "Ratio of Manufactured Inputs to Total Intermediary Inputs Used", y = "3 Digit NAICS Code") +
theme_bw() +
axis_theme
## Joining with `by = join_by(com_code_3)`
ggplotly(manf_input_sum)
We now turn to the MAKE table.
supply_table_raw <- read.csv(here("Input Output Data/SUPPLY_TABLE_2017/2017-Table 1.csv"))
We create a crosswalk between industry codes and their descriptions.
supply_cols <- colnames(supply_table_raw)
sup_ind_codes <- supply_table_raw %>%
head(1) %>%
pivot_longer(-c(1,2), names_to = "industry_desc", values_to = "industry_code") %>%
select(industry_code, industry_desc)
The first step will be to convert this data into a long data format.
supply_table_long <- supply_table_raw[-1,] %>%
pivot_longer(-c(1,2), names_to = "industry_desc", values_to = "value", values_drop_na = TRUE) %>%
mutate(value_clean = str_remove_all(value, "[[:punct:]]"),
value_num = as.numeric(value_clean)) %>%
filter(!is.na(value_num)) %>%
rename(commodity_code = 1, commodity_desc = 2) %>%
left_join(sup_ind_codes)
## Joining with `by = join_by(industry_desc)`
The resulting dataframe details how individuals industries MAKE specific commodities. E.G the SUPPLY for specific OUTPUTS from industries. Total output from industry are given by the “diagonal” entry (i.e commodity code = industry code).
Again, we add a few additional descriptive columns to be able to segment and sort our data. Values in this data are then the total amount in millions of dollars of specific outputs produced by specific industries.
supply_table_clean <- supply_table_long %>%
mutate(com_code_3 = as.numeric(str_sub(commodity_code, 1, 3)),
naics_3digit = as.numeric(str_sub(industry_code, 1, 3))) %>%
left_join(ind_3digit)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `com_code_3 = as.numeric(str_sub(commodity_code, 1, 3))`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
## Joining with `by = join_by(naics_3digit)`
As above, we note that some of the commodity codes refer to aggregate measures. The same is true for some of the industry codes.
sup_agg_coms <- supply_table_clean %>% select(commodity_code, commodity_desc, com_code_3) %>% distinct() %>% filter(is.na(com_code_3))
sup_agg_coms <- sup_agg_coms[-c(1:2),]
sup_agg_coms %>% select(1:2)
## # A tibble: 13 × 2
## commodity_code commodity_desc
## <chr> <chr>
## 1 52A000 Monetary authorities and depository credit intermediation
## 2 S00500 Federal general government (defense)
## 3 S00600 Federal general government (nondefense)
## 4 S00102 Other federal government enterprises
## 5 GSLGE State and local government (educational services)
## 6 GSLGH State and local government (hospitals and health services)
## 7 GSLGO State and local government (other services)
## 8 S00203 Other state and local government enterprises
## 9 S00401 Scrap
## 10 S00402 Used and secondhand goods
## 11 S00300 Noncomparable imports
## 12 S00900 Rest of the world adjustment
## 13 T017 Total industry supply
These codes are useful in informing aggregate measures, such as total industry supply. As before, some aggregation over industry codes is necessary as well. Here, aggregation includes the measures of imports per commodity, as well as total commodity output.
sup_agg_ind <- supply_table_clean %>% select(industry_code, industry_desc, naics_3digit) %>% distinct() %>% filter(is.na(naics_3digit))
Because of the way that these databases are constructred, the sup_agg_ind dataframe above refers to commodity level summaries, while the sup_agg_com dataframe above refers to industry level summaries. We spend some time at the moment to create some commodity and industry level benchmarks.
sup_agg_coms %>%
filter(commodity_code %in% c("T017")) %>%
select(1:2)
## # A tibble: 1 × 2
## commodity_code commodity_desc
## <chr> <chr>
## 1 T017 Total industry supply
We begin with creating benchmarks for total supply by industry.
sup_ind_agg <- supply_table_clean %>%
filter(commodity_code %in% c("T017")) %>%
select(commodity_code, commodity_desc, industry_code, industry_desc, value_num) %>%
pivot_wider(id_cols = c(industry_code, industry_desc), names_from = commodity_code, values_from = value_num)
colnames(sup_ind_agg) <- c("industry_code", "industry_desc", "total_industry_supply")
We now create some commodity level summaries, detailing the total commodity output, as well as total commodity imports
sup_com_agg <- supply_table_clean %>%
filter(industry_code %in% c("T007", "MCIF")) %>%
select(commodity_code, commodity_desc, industry_code, industry_desc, value_num) %>%
pivot_wider(id_cols = c(commodity_code, commodity_desc), names_from = industry_desc, values_from = value_num)
colnames(sup_com_agg) <- c("commodity_code", "commodity_desc", "total_commodity_output", "total_commodity_imports")
We note that the very few of the top commodities are manufactured outputs. The predominant commodities are housing, hospitals, state and local government spending, R&D services, federal defense spending, insurance, management services, public housing, physicians offices, petroleum refiners, and electric power generation and transmission. The first strictly manufactured output is not until row 40, with Light truck and utility vehicle manufacturing. Certain sectors such as: Electric power generation, transmission, and distribution; Machinery, equipment and supplies; Architectural, engineering, and related services; might be related to manufacturing, but are not generally produced by manufacturing industries (as defined by NAICS codes). As such, we must remember that any final analysis will be extremely sensitive to the exact specifications of NAICS and commodity codes.
At this point, it is useful to remind ourselves of our higher level goals. From the make table, we seek to create two measures: 1) a measure of common outputs among manufacturing firms, and 2) a measure of the which manufacturing sectors produce inputs used by other sectors (both manufacturing, and non-manufacturing).
The second part of this puzzle will be more difficult to solve. As such we, start by answering two questions. 1) Which manufacturing sectors produce similar outputs? and 2) Do manufacturing sectors that use similar inputs produce similar outputs?
In this case, to normalize output of manufacturing sectors, we use our commodity level aggregation. For each commodity, we consider what percentage of total output of that commodity is produced by a given manufacturing sector.
supply_table_final <- supply_table_clean %>%
filter(!is.na(com_code_3), !is.na(naics_3digit)) %>%
select(-c(value, value_clean))
Here, we bring in earlier segmented aggregate data about total industry supply to benchmark industry production of specific outputs against their total supply of outputs.
manf_supply <- supply_table_final %>%
filter(naics_3digit >= 300 & naics_3digit < 400) %>%
left_join(sup_com_agg) %>%
mutate(manf_percent = value_num / total_commodity_output) %>% #percentage of a given output that is manufactured by a given manufacturing industry
left_join(sup_ind_agg) %>%
mutate(supply_percent = value_num / total_industry_supply) %>% #percentage of total industry supply that a given output makes up
group_by(industry_code) %>%
mutate(prod_num = n(), #Number of total products an industry produces
main_prod = max(supply_percent, na.rm = TRUE)) #Across all commodities an industry produces, the maximum of the percentage of total industry supply a given output makes up
## Joining with `by = join_by(commodity_code, commodity_desc)`
## Joining with `by = join_by(industry_desc, industry_code)`
At this point we note that there are some non-manufacturing industries that produce manufactured outputs.
non_manf_manf <- supply_table_final %>%
filter(com_code_3 >= 300 & com_code_3 < 400, value_num > 0, naics_3digit < 300 | naics_3digit >= 400)
However, these industries produce a very small percentage of the total manufactured commodity output. For example, for petroleum refineries, the oil and gas extraction NAICS code (211000) produces less than 6% of what the main manufacturing sector (Petroleum refineries: 324110) produces. As such, we do not worry about them for the moment
manf_supply$com_code_num <- as.numeric(sub("A.*", "", manf_supply$commodity_code))
## Warning: NAs introduced by coercion
manf_supply_graph <- manf_supply %>%
select(industry_code, industry_desc, main_prod) %>%
distinct()
Across our 232 manufacturing sub-sectors, we see that the vast majority of our manufacturing industries produce predominately 1 product
hist(manf_supply_graph$main_prod, xlab = "Main Product Percent of NAICS-6-Digit Output", main = "Main Product Percent of NAICS-6-Digit Output, 2019")
Only 83 sub-sectors produce less than 90% of the same output, and of these, only 30 sub-sectors produce less than 80% of the same output
manf_other <- manf_supply_graph %>%
filter(main_prod < .8) %>%
mutate(naics_3digit = as.numeric(str_sub(industry_code, 1, 3))) %>%
left_join(ind_3digit)
## Joining with `by = join_by(naics_3digit)`
manf_other %>%
ungroup() %>%
group_by(naics_3digit_label) %>%
count() %>%
arrange(desc(n))
## # A tibble: 12 × 2
## # Groups: naics_3digit_label [12]
## naics_3digit_label n
## <chr> <int>
## 1 Computer and electronic product manufacturing 6
## 2 Chemical manufacturing 5
## 3 Transportation equipment manufacturing 4
## 4 Food manufacturing 3
## 5 Machinery manufacturing 2
## 6 Miscellaneous manufacturing 2
## 7 Primary metal manufacturing 2
## 8 Textile mills 2
## 9 Fabricated metal product manufacturing 1
## 10 Leather and allied product manufacturing 1
## 11 Paper manufacturing 1
## 12 Printing and related support activities 1
We see that these sub-sectors are spread across a vareity of NAICS-3digit level sectors, and that computer and electronic product manufacturing as well as chemical manufacturing have the most sub-sectors with multiple outputs.
These sub-sectors include some interesting sectors, such as semiconductor and related device manufacturing, petrochemical manufacturing, pharmaceutical preparation manufacturing, and others.
manf_other %>%
ungroup() %>%
filter(str_detect(naics_3digit_label, "Chemical|Computer")) %>%
select(industry_desc, main_prod) %>%
arrange(desc(main_prod))
## # A tibble: 11 × 2
## industry_desc main_prod
## <chr> <dbl>
## 1 Pharmaceutical.preparation.manufacturing 0.794
## 2 Biological.product..except.diagnostic..manufacturing 0.790
## 3 Semiconductor.and.related.device.manufacturing 0.784
## 4 Other.communications.equipment.manufacturing 0.762
## 5 Computer.storage.device.manufacturing 0.756
## 6 In.vitro.diagnostic.substance.manufacturing 0.748
## 7 Petrochemical.manufacturing 0.730
## 8 Computer.terminals.and.other.computer.peripheral.equipment.manufac… 0.721
## 9 Medicinal.and.botanical.manufacturing 0.691
## 10 Telephone.apparatus.manufacturing 0.661
## 11 Electronic.computer.manufacturing 0.521
However, we also see that even among industries that produce multiple outputs, the lowest is fiber yarn and thread mills, and even for them, 48% of total industry supply comes from Synthetic rubber and artificial and synthetic fibers and filaments manufacturing, and another 44% comes from Fiber, yarn, and thread mills. Returning to our initial data, we see that across industries with multiple outputs, more than 80% of total industry supply is accounted for in the top two outputs. As another example, electronic computer manufacturing has 52% of its industry supply in scientific R&D and 43% of its industry supply in electronic computer manufacturing.
Thus, the top 1 or 2 outputs for an industry define the majority of that industries output. However, we might not want to completely discard the vector of other outputs that an industry produces. For one, many times this vector includes scientific R&D, which we may be interested in isolating to understand the distribution of R&D activity across sectors. In addition, the distribution of activity across other, non-primary outputs might reveal additional information that could be valuable.
To account for this, we explicitly consider highlighting scientific R&D and isolating this output. In addition, we cluster across vectors of non-primary commodities and group sectors together based on shared non-primary outputs. In addition, defining this approach now might prove useful in the future, when we are searching for non-manufacturing industries that are non-primary producers of inputs used in manufacturing industries.
Here, we create two separate dataframes. The first indexes the intensity of scientific R&D per manufacturing sector.
manf_rd <- manf_supply %>%
filter(commodity_desc == "Scientific research and development services", value_num > 0) %>%
select(industry_code, industry_desc, value_num, naics_3digit_label, total_commodity_output, total_industry_supply) %>%
mutate(rd_sup_perc = value_num / total_industry_supply, #percent of total industry supply that is scientific R&D
rd_com_perc = value_num / total_commodity_output) #percent of total R&D that comes from a specific industry
We see that certain subsectors are for more R&D intensive than others, and that certain NAICS-3digit sectors overall are more R&D intensive than others.
rd_graph <- manf_rd %>%
group_by(naics_3digit_label) %>%
mutate(max_rd = max(rd_sup_perc)) %>%
ggplot(aes(x = rd_sup_perc, y = reorder(naics_3digit_label, max_rd), fill = naics_3digit_label, label = industry_desc)) +
geom_density_ridges(aes(), alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
geom_point(aes(), shape = 21) +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "R&D Percent of Industry Total Supply", y = "NAICS 3-digit Sector") +
theme_bw() +
axis_theme
rd_graph
## Picking joint bandwidth of 0.00605
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
rd_graph1 <- manf_rd %>%
group_by(naics_3digit_label) %>%
mutate(max_rd = max(rd_sup_perc)) %>%
ggplot(aes(x = rd_sup_perc, y = reorder(naics_3digit_label, max_rd), fill = naics_3digit_label, label = industry_desc)) +
geom_point(aes(), shape = 21) +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "R&D Percent of Industry Total Supply", y = "NAICS 3-digit Sector") +
theme_bw() +
axis_theme
ggplotly(rd_graph1)
However, each manufacturing industry contributes very little to total R&D supply. Again, this is not surprising. Because of the way that NAICS codes work, we expect that establishments with R&D output will be classified in the 500’s, rather than in manufacturing.
Exploring both graphs carefully, we see that the industries that contribute more to the total supply of R&D are NOT always the same as the industries that have a higher share of total industry supply.
rd_com <- manf_rd %>%
group_by(naics_3digit_label) %>%
mutate(max_rd = max(rd_com_perc)) %>%
ggplot(aes(x = rd_com_perc, y = reorder(naics_3digit_label, max_rd), fill = naics_3digit_label, label = industry_desc)) +
geom_density_ridges(aes(), alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
geom_point(aes(), shape = 21) +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "Industry Percent of Total R&D Supply", y = "NAICS 3-digit Sector") +
theme_bw() +
axis_theme
rd_com
## Picking joint bandwidth of 0.000192
## Warning: The following aesthetics were dropped during statistical transformation: label
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
rd_com1 <- manf_rd %>%
group_by(naics_3digit_label) %>%
mutate(max_rd = max(rd_com_perc)) %>%
ggplot(aes(x = rd_com_perc, y = reorder(naics_3digit_label, max_rd), fill = naics_3digit_label, label = industry_desc)) +
geom_point(aes(), shape = 21) +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "Industry Percent of Total R&D Supply", y = "NAICS 3-digit Sector") +
theme_bw() +
axis_theme
ggplotly(rd_com1)
rd_sup_com <- manf_rd %>%
group_by(naics_3digit_label) %>%
mutate(outlier_label = case_when(
rd_sup_perc > 0.15 ~ industry_desc,
rd_com_perc > .005 ~ industry_desc,
)) %>%
ggplot(aes(x = rd_sup_perc, y = rd_com_perc, fill = naics_3digit_label, label = industry_desc)) +
geom_point(aes(), size = 4, alpha = 0.7, color = "black", shape = 21) +
geom_label_repel(aes(label = outlier_label), size = 4, max.overlaps = 20) +
scale_fill_manual(values = emp_ind_vector) +
guides(fill = "none") +
labs(x = "R&D Percent of Industry Total Supply", y = "Industry Percent of Total R&D Supply", y = "NAICS 3-digit Sector", title = "Industry % of total R&D Supply not always correlated with R&D intensity") +
theme_bw() +
axis_theme
rd_sup_com
## Warning: Removed 208 rows containing missing values (`geom_label_repel()`).
We now turn to non-primary manufacturing outputs. Here, we are particularly interested in the vector of outputs for each industry that describes their non-primary activities.
We can define this vector by eliminating commodities that do not make up at least 5% of total industry supply.
manf_supply_clean <- manf_supply %>%
filter(supply_percent > 0.05) %>%
group_by(industry_desc) %>%
mutate(count = n())
#Create function to make a histogram from a dataframe
df_hist <- function(data, var, title){
data %>%
ungroup() %>%
select({{ var }}, count) %>%
distinct() %>%
select(count) %>%
unlist() %>%
unname() %>%
hist(main = paste("Histogram of", title, sep = " "), xlab = title)
}
With this definition, we see that most industries have 1 output, with a few having 2, and very few having either 3 or 4.
df_hist(manf_supply_clean, industry_desc, "Number of Outputs per Industry")
We can now turn our attention to the vector of other outputs per manufacturing sector.
non_prim_manf <- manf_supply %>%
filter(value_num > 0, supply_percent <= 0.05)
We see that the distribution of production of non-primary components varies across 3-digit sectors (e.g transportation equipment manufacturing seems to have more industries with non-primary production than other 3 digit sectors) as well as within sectors (e.g. Petroleum and coal product manufacturing has high variance in the production of non-primary outputs across industries)
non_prim_ind <- non_prim_manf %>%
ggplot(aes(x = supply_percent, y = naics_3digit_label, fill = as.factor(com_code_3))) +
geom_density_ridges(aes(), alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
labs(x = "Commodity percent of Total Industry Supply", y = "NAICS 3-digit Sector", fill = "3 digit Commodity Code") +
theme_bw() +
theme(legend.position = "bottom") +
axis_theme
non_prim_ind
## Picking joint bandwidth of 0.00123
We also see that some of these non-primary products nonetheless make up a non-insignificant portion of total commodity supply. For example, petroleum and col products, as well as transportation equipment, and chemical manufacturing all have non-primary outputs that make up more than 20% of total commodity supply.
non_prim_com <- non_prim_manf %>%
ggplot(aes(x = manf_percent, y = naics_3digit_label, fill = as.factor(com_code_3))) +
geom_density_ridges(aes(), alpha = 0.7, jittered_points = TRUE, point_alpha=1,point_shape=21) +
labs(x = "Industry Percent of Total Commodity Supply", y = "NAICS 3-digit Sector", fill = "3 digit Commodity Code") +
theme_bw() +
theme(legend.position = "bottom") +
axis_theme
non_prim_com
## Picking joint bandwidth of 0.00164
non_prim_manf_comp <- non_prim_manf %>%
group_by(naics_3digit_label) %>%
mutate(ind_com_lab = paste(industry_desc, commodity_desc, sep = "by COM: ")) %>%
ggplot(aes(x = supply_percent, y = manf_percent, label = ind_com_lab, fill = as.factor(com_code_3))) +
geom_point(aes(), shape = 21, size = 3) +
# geom_label_repel(aes(label = outlier_label), size = 2, max.overlaps = 20) +
labs(x = "Percent of Industry Total Supply", y = "Industry Percent of Total Supply") +
theme_bw() +
axis_theme
ggplotly(non_prim_manf_comp)
We again see a strong example where commodity production makes up a low percentage of total industry output, but a high percentage of the total available supply of said commodity. As such, it becomes very difficult to determine where a cutoff might be, as even industryXcommodity pairs with relatively low output levels in total may nonetheless produce a meaningful share of a commodity, especially if highly densly clustered together. Because of this, we think about scaling against location quotients, or some alternative measure to capture: out of the total number of establishments that produce this output, how many are located in location x?
There might be an argument to include non-primary outputs that make up less than 5% of total industry supply but account for more than 10% of total commodity output in our definition and categorization of manufacturing sub-sectors. We note that doing this would not substantially increase the number of observations or total products in our initial data set.
manf_supply %>%
filter(supply_percent > 0.05 | manf_percent > .1) %>%
group_by(industry_desc) %>%
mutate(count = n()) %>%
df_hist(industry_desc, "Total Products per Sector")
In general we see that dropping non-primary outputs might drop certain important observations.
Looking across our manufacturing sector database, let us consider the question of how many industries produce specific commodities. We see that almost all manufacturing industries produce “manufactured structures”, “Scientific research and development services” as well as “Custom computer programming services”. Most commodities are manufactured by more than one industry! Plastics, durable goods merchant wholesalers, fabricated metal, machine shops, textile product mills, chemical product and preparation, are other commodities produced by upwards of 50 sub-sectors.
manf_supply %>%
group_by(commodity_desc) %>%
mutate(count = n()) %>%
df_hist(commodity_desc, "Industries Producing a Commodity")
We primarily focus on examining the percent of total supply that each industry produces across the vector of products that they manufacture.
We again return to our clustering methodology
output_clustering <- manf_supply %>%
select(commodity_code, commodity_desc, industry_desc, industry_code, value_num, supply_percent) %>%
pivot_wider(id_cols = c(industry_desc, industry_code), values_from = supply_percent, names_from = commodity_code, values_fill = 0)
output_clustering$ind_code_num <- as.numeric(sub("A.*", "", output_clustering$industry_code))
## Warning: NAs introduced by coercion
We create a 232x255 matrix of 232 manufacturing industries (i), and their output of 255 commodities (c).
output_raw_matrix <- output_clustering %>%
ungroup() %>%
select(-c(industry_code, industry_desc, ind_code_num)) %>%
as.matrix()
In this case, we can use what we learned earlier to skip to our hypothesized best clustering measure. Our goal is to measure the difference in average distance between manufacturing sub-sectors using the NAICS 3-digit level system and our input-based clustering assignment.
out_dist <- as.dist(1-cor(t(output_raw_matrix))) %>% as.matrix()
Our resulting matrix tells us the distance between a given industry and other industries based on the distance between their output vectors.
ind_order <- output_clustering %>%
select(industry_desc, industry_code) %>%
mutate(naics_3digit = as.numeric(str_sub(industry_code, 1, 3))) %>%
left_join(ind_3digit)
## Joining with `by = join_by(naics_3digit)`
ind_order_holder <- ind_order
To calculate the average distance between industries within the same cluster, we must define a distance calculation function. In the distance matrix \(M_{ij}\), both i and j are industries. As such, a given row M[i,] that selects on columns within a specific cluster should summarize the distances between a given industry i and the other industries in its cluster. We will need to perform this for all i in our sample, effectively examining a subset of the initial matrix \(M_{ij}\)
For ease of applying this same process to other clustering algorithm outputs, we create a series of functions to allow us to get the average distance for members within a specific cluster, as well as the distance for each member to the cluster.
## Generic function to get the equivalent of unique(df$col_name) in the tidytext format to allow for piping.
unique_df <- function(data, var){
data %>%
ungroup() %>%
select({{ var }}) %>%
distinct() %>%
unlist() %>%
unname()
}
This function takes in an “odering” dataframe that is organized in the same way as the matrix of distances that we are seeking to average over, the variable that stores clustering assignments, and the matrix of distance measures that we are seeing to average.
#DF[row, column]
get_cluster_distance <- function(data, var, matrix) {
#Make ordering index
order <- data %>%
unique_df({{ var }})
#Make Holder Data Frames
cluster_dist <- data.frame(order, cluster_mean = numeric(length(order)))
ind_clust_dist <- data.frame(industry_code = ind_order$industry_code, sec_cluster_dist = numeric(length(ind_order$industry_code)))
for (i in 1:length(order)) {
#get cluster category
ind_codes <- data %>%
filter( {{ var }} == order[i]) %>%
ungroup() %>%
select(industry_code) %>%
unlist() %>%
unname()
#create an index
ind_rows <- which(data$industry_code %in% ind_codes)
#use that index to get a subset of M_ij
m_ij_sub <- matrix[ind_rows, ind_rows]
#create an average measure of distance for the cluster
cluster_mean <- mean(m_ij_sub)
cluster_dist$cluster_mean[i] <- cluster_mean
#get average distance for each element of the cluster to other elements in the cluster
if (length(m_ij_sub) > 1){
ind_dist <- rowMeans(m_ij_sub) %>% unname()
ind_clust_dist$sec_cluster_dist[ind_rows] <- ind_dist
}
else {
ind_clust_dist$sec_cluster_dist[ind_rows] <- NA_real_
}
}
ind_order_holder <- data %>%
rename( order = {{ var }}) %>%
left_join(cluster_dist) %>%
left_join(ind_clust_dist)
return(ind_order_holder)
}
ind_order_naics <- get_cluster_distance(ind_order, naics_3digit_label, out_dist) %>%
rename(naics_3digit_label = order,
naics_3mean = cluster_mean,
naics_dist = sec_cluster_dist)
## Joining with `by = join_by(order)`
## Joining with `by = join_by(industry_code)`
We now have information about the average distance among the vector of outputs produced by the 6-digit subsectors within a specific NAICS 3-digit category, and the average distance for a given 6-digit subsector to the other entities in its cluster.
Our hypothesis is that this distance will be substantial different (and likely larger) when grouping sub-sectors by common inputs instead of outputs.
As such, we must return to our initial clustering assignments and join this data with our existing dataframe.
ind_order_cluster <- ind_order_naics %>%
left_join(cluster_assignments %>% rename(industry_code = ind_code))
## Joining with `by = join_by(industry_code)`
We then run our function to get cluster distances.
ind_order_final <- ind_order_cluster %>%
mutate(h_label = paste("H_", h_clusters, sep = "")) %>%
get_cluster_distance(h_label, out_dist) %>%
rename(h_label = order,
h_mean = cluster_mean,
h_dist = sec_cluster_dist)
## Joining with `by = join_by(order)`
## Joining with `by = join_by(industry_code)`
We can now measure the distance between sectors within a cluster across our two clustering algorithms. We indeed see that there is a shift in the distribution of distances across clusters. Here, our measure of distance is actually the pearsons correlation coeffecient. Thus, values closer to 1 mean tha they are more related to one another.
dist_comp <- ind_order_final %>%
ungroup() %>%
select(naics_3digit_label, h_label, naics_3mean, h_mean) %>%
distinct() %>%
ggplot() +
geom_density(aes(x = naics_3mean), fill = "blue", alpha = 0.7) +
geom_density(aes(x = h_mean), fill = "purple", alpha = 0.7) +
geom_text(x = 1, y = 1.5, label = "Average Correlation, Input Clustering", color = "purple") +
geom_text(x = .8, y = 6, label = "Average Correlation,\nNAICS (Output) Clustering", color = "blue") +
labs(x = "Average Cluster Correlation", y = "Observations") +
theme_bw() +
axis_theme
dist_comp
naics_clusters <- ind_order_final %>%
ggplot() +
geom_density_ridges(aes(x = naics_dist, y = reorder(naics_3digit_label, naics_3mean), fill = naics_3mean)) +
labs(x = "Distribution of Distance between Sub-sectors and Their Clusters", y = "NAICS 3-Digit Codes") +
scale_fill_distiller(palette = "Spectral") +
guides(fill = "none") +
theme_bw() +
axis_theme
h_dist <- ind_order_final %>%
distinct() %>%
ggplot() +
geom_density_ridges(aes(x = h_dist, y = reorder(h_label, h_mean), fill = h_mean)) +
labs(x = "Distribution of Distance between Sub-sectors and Their Clusters", y = "H-Clustering") +
scale_fill_distiller(palette = "Spectral") +
guides(fill = "none") +
theme_bw() +
axis_theme
If we compare the distribution of distances between sub-sectors across our two clustering algorithms, we see that while the NAICS 3-digit clustering has more very high-correlation sub-sectors, the drop off is faster and steeper for other sub-sectors, while the input-based clustering algorithm has a less steep drop off, and more clusters in the middle.
dist_comp <- naics_clusters / h_dist
dist_comp
## Picking joint bandwidth of 0.025
## Warning: Removed 2 rows containing non-finite values (`stat_density_ridges()`).
## Picking joint bandwidth of 0.0243
## Warning: Removed 1 rows containing non-finite values (`stat_density_ridges()`).
We now turn to a crucial question of our analysis: which industries produce inputs that manufacturing industries demand? In this part of our analysis, contrary to our previous work, we expand the scope of our analysis to consider all industries once again. From before, we remember that non-manufacturing sectors sometimes still use more than 50% of their inputs as manufactured inputs. Thus, when considering co-location, we must account for manufacturing co-location near end users. However, we might want to distinguish between vertical agglomeration within the manufacturing and material production sectors (e.g. supplier networks), compared to agglomeration that occurs outside of the manufacturing sector.
Again, a critical point of this work is to highlight the heterogeneity in linkages between sectors, as well as regional heterogeneity in these linkages. We argue that different types of agglomeration effects have different impacts, and as such, require different policy tools to cultivate and develop.
manf_input_sum
How should we look at production of inputs? Share of total input requirements that are produced? distance between vectors of outputs and vectors of inputs?
CRITICAL QUESTION: to what extent do we focus on networks within manufacturing v. flows outside of manufacturing? Maybe as a robustness test?
FInal eq: State quadrants + composition + avg_change_composition + …?